In this tutorial we’ll learn:
merge datasetsWe’re working with the Capital Bikeshare again this week, so start by reading in usage, weather, stations.
library(dplyr)
library(ggplot2)
library(lubridate)
usage = read.delim('usage_2012.tsv',
sep = '\t',
header = TRUE)
weather = read.delim('daily_weather.tsv',
sep = '\t',
header = TRUE)
stations = read.delim('stations.tsv',
sep = '\t',
header = TRUE)
We have three related datasets to work with, but we can’t really get started until they’re combined. Let’s start with usage and weather. The usage dataframe is at the resolution of the hour, while the weather data are at the resolution of a day, so we know we’re going to have to either duplicate or compress data to merge. I vote compress, let’s summarize!
head(usage)
## bike_id time_start time_end duration_mins
## 1 W01412 2012-01-01 00:04:00 2012-01-01 00:11:00 7
## 2 W00524 2012-01-01 00:10:00 2012-01-01 00:29:00 19
## 3 W00235 2012-01-01 00:10:00 2012-01-01 00:29:00 19
## 4 W00864 2012-01-01 00:15:00 2012-01-01 00:23:00 8
## 5 W00995 2012-01-01 00:15:00 2012-01-01 00:23:00 8
## 6 W00466 2012-01-01 00:17:00 2012-01-01 00:23:00 6
## station_start station_end cust_type
## 1 7th & R St NW / Shaw Library 7th & T St NW Registered
## 2 Georgia & New Hampshire Ave NW 16th & Harvard St NW Casual
## 3 Georgia & New Hampshire Ave NW 16th & Harvard St NW Registered
## 4 14th & V St NW Park Rd & Holmead Pl NW Registered
## 5 11th & Kenyon St NW 7th & T St NW Registered
## 6 Court House Metro / 15th & N Uhle St Lynn & 19th St North Registered
custs_per_day =
usage %>%
group_by(time_start = as.Date(time_start), station_start, cust_type) %>%
summarize(no_rentals = n(),
duration_mins = mean(duration_mins, na.rm = TRUE))
head(custs_per_day)
## Source: local data frame [6 x 5]
## Groups: time_start, station_start
##
## time_start station_start cust_type no_rentals
## 1 2012-01-01 10th & Monroe St NE Registered 10
## 2 2012-01-01 10th & U St NW Casual 8
## 3 2012-01-01 10th & U St NW Registered 50
## 4 2012-01-01 10th St & Constitution Ave NW Casual 34
## 5 2012-01-01 10th St & Constitution Ave NW Registered 20
## 6 2012-01-01 11th & H St NE Casual 4
## Variables not shown: duration_mins (dbl)
Perfection, now we can merge! What’s the key?
# make sure we have consistent date formats
custs_per_day$time_start = ymd(custs_per_day$time_start)
weather$date = ymd(weather$date)
# then merge. see ?merge for more details about the function
weather_rentals = merge(custs_per_day, weather,
by.x = 'time_start', by.y = 'date')
# check dimensions after to make sure they are what you expect
dim(custs_per_day)
## [1] 57376 3
dim(weather)
## [1] 366 15
dim(weather_rentals)
## [1] 57376 17
head(weather_rentals)
## time_start station_start no_rentals weekday season_code
## 1 2012-01-01 10th & Monroe St NE 10 0 1
## 2 2012-01-01 10th & U St NW 58 0 1
## 3 2012-01-01 10th St & Constitution Ave NW 54 0 1
## 4 2012-01-01 11th & H St NE 20 0 1
## 5 2012-01-01 11th & Kenyon St NW 58 0 1
## 6 2012-01-01 12th & Army Navy Dr 12 0 1
## season_desc is_holiday is_work_day weather_code
## 1 Spring 0 0 1
## 2 Spring 0 0 1
## 3 Spring 0 0 1
## 4 Spring 0 0 1
## 5 Spring 0 0 1
## 6 Spring 0 0 1
## weather_desc temp subjective_temp
## 1 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## 2 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## 3 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## 4 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## 5 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## 6 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37 0.375621
## humidity windspeed no_casual_riders no_reg_riders total_riders
## 1 0.6925 0.192167 686 1608 2294
## 2 0.6925 0.192167 686 1608 2294
## 3 0.6925 0.192167 686 1608 2294
## 4 0.6925 0.192167 686 1608 2294
## 5 0.6925 0.192167 686 1608 2294
## 6 0.6925 0.192167 686 1608 2294
Great, now we want to merge on the last dataset, stations. What is the key to link weather_rentals with stations?
final_data = merge(weather_rentals, stations,
by.x = 'station_start', by.y = 'station')
dim(final_data)
## [1] 57011 152
dim(weather_rentals)
## [1] 57376 17
head(final_data[, 1:30])
## station_start time_start no_rentals weekday season_code season_desc
## 1 10th & E St NW 2012-12-03 40 1 4 Winter
## 2 10th & E St NW 2012-09-25 57 2 4 Winter
## 3 10th & E St NW 2012-11-10 49 6 4 Winter
## 4 10th & E St NW 2012-11-12 29 1 4 Winter
## 5 10th & E St NW 2012-12-05 39 3 4 Winter
## 6 10th & E St NW 2012-11-30 31 5 4 Winter
## is_holiday is_work_day weather_code
## 1 0 1 1
## 2 0 1 1
## 3 0 0 1
## 4 1 0 1
## 5 0 1 1
## 6 0 1 1
## weather_desc temp subjective_temp
## 1 Clear, Few clouds, Partly cloudy, Partly cloudy 0.452500 0.455796
## 2 Clear, Few clouds, Partly cloudy, Partly cloudy 0.550000 0.544179
## 3 Clear, Few clouds, Partly cloudy, Partly cloudy 0.389167 0.393937
## 4 Clear, Few clouds, Partly cloudy, Partly cloudy 0.485000 0.475383
## 5 Clear, Few clouds, Partly cloudy, Partly cloudy 0.438333 0.428012
## 6 Clear, Few clouds, Partly cloudy, Partly cloudy 0.298333 0.323867
## humidity windspeed no_casual_riders no_reg_riders total_riders id
## 1 0.767500 0.0827208 555 5679 6234 199
## 2 0.570000 0.2363210 845 6693 7538 199
## 3 0.645417 0.0578458 2090 4446 6536 199
## 4 0.741667 0.1735170 1097 5172 6269 199
## 5 0.485000 0.3240210 331 5398 5729 199
## 6 0.649583 0.0584708 362 5306 5668 199
## terminal_name lat long no_bikes no_empty_docks fast_food
## 1 31256 38.89591 -77.02606 6 8 5
## 2 31256 38.89591 -77.02606 6 8 5
## 3 31256 38.89591 -77.02606 6 8 5
## 4 31256 38.89591 -77.02606 6 8 5
## 5 31256 38.89591 -77.02606 6 8 5
## 6 31256 38.89591 -77.02606 6 8 5
## parking restaurant convenience post_office bicycle_parking
## 1 2 16 0 1 4
## 2 2 16 0 1 4
## 3 2 16 0 1 4
## 4 2 16 0 1 4
## 5 2 16 0 1 4
## 6 2 16 0 1 4
## drinking_water
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# probably want to save this now!
write.table(final_data,
'bikeshare_modeling_data.tsv',
row.names = FALSE, sep = '\t')
# rename to something more convenient and remove from memory
data = final_data
rm(final_data)
lm() functionThe function for creating a linear model in R is lm() and the primary arguments are formula and data. Formulas in R are a little funny, instead of an = sign, they are expressed with a ~. Let’s fit the model we saw in the lecture notes: \(rentals = \beta_0 + \beta_1*crossing\). There’s a little snag we have to take care of first. Right now we’ve got repeated measures i.e. one measurement per day, so we need to aggregate again this time over date.
rentals_crossing =
data %>%
group_by(station_start) %>%
summarize(mean_rentals = mean(no_rentals),
crossing = mean(crossing))
head(rentals_crossing)
## Source: local data frame [6 x 3]
##
## station_start mean_rentals crossing
## 1 10th & E St NW 36.79070 122
## 2 10th & Monroe St NE 10.59167 1
## 3 10th & U St NW 71.24317 5
## 4 10th St & Constitution Ave NW 55.92603 116
## 5 11th & H St NE 35.18579 73
## 6 11th & Kenyon St NW 61.81694 20
# plot it
ggplot(rentals_crossing, aes(x = crossing, y = mean_rentals)) +
geom_smooth(method = 'lm', size = 2) +
geom_point(size = 4, alpha = 0.60) +
theme_minimal()
model = lm(mean_rentals ~ crossing, data = rentals_crossing)
# view what is returned in the lm object
attributes(model)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
# get model output
summary(model)
##
## Call:
## lm(formula = mean_rentals ~ crossing, data = rentals_crossing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.815 -21.176 -8.394 13.765 126.969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.53163 2.59354 10.615 < 2e-16 ***
## crossing 0.49159 0.07031 6.992 4.92e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.54 on 183 degrees of freedom
## Multiple R-squared: 0.2108, Adjusted R-squared: 0.2065
## F-statistic: 48.88 on 1 and 183 DF, p-value: 4.92e-11
# print model diagnostics
par(mfrow = c(2, 2))
plot(model)
The attributes() function can be called on just about any object in R and it returns a list of all the things inside. It’s a great way to explore objects and see what values are contained inside that could be used in other analysis. For example, extracting the residuals via model$residuals is useful if we want to print diagnostic plots like those above.
When we run summary() on the lm object, we see the results. The Call section just prints back the model specification, and the Residuals section contains a summary of the distribution of the errors. The fun stuff is in the Coefficients section. In the first row contains the covariate names followed by their estimates, standard errors, t- and p-values. Our model ends up being rentals = 15 + 0.24(crosswalks) which means that the average number of rentals when there are no crosswalks is 15, and the average increases by 1 rental for every four additional crosswalks.
We can fit regressions with multiple covariates the same way:
# lets include windspeed this time
rentals_multi =
data %>%
group_by(station_start) %>%
summarize(mean_rentals = mean(no_rentals),
crossing = mean(crossing),
windspeed = mean(windspeed))
head(rentals_multi)
## Source: local data frame [6 x 4]
##
## station_start mean_rentals crossing windspeed
## 1 10th & E St NW 36.79070 122 0.1733099
## 2 10th & Monroe St NE 10.59167 1 0.1885038
## 3 10th & U St NW 71.24317 5 0.1895723
## 4 10th St & Constitution Ave NW 55.92603 116 0.1891103
## 5 11th & H St NE 35.18579 73 0.1895723
## 6 11th & Kenyon St NW 61.81694 20 0.1895723
ggplot(rentals_multi, aes(x = windspeed, y = mean_rentals)) +
geom_smooth(method = 'lm', size = 2) +
geom_point(size = 4, alpha = 0.60) +
theme_minimal()
model = lm(mean_rentals ~ crossing + windspeed, data = rentals_multi)
summary(model)
##
## Call:
## lm(formula = mean_rentals ~ crossing + windspeed, data = rentals_multi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.528 -19.319 -5.397 11.865 121.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.00979 69.62422 -5.027 1.19e-06 ***
## crossing 0.45968 0.06568 6.999 4.78e-11 ***
## windspeed 2036.29096 375.29657 5.426 1.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.48 on 182 degrees of freedom
## Multiple R-squared: 0.3207, Adjusted R-squared: 0.3132
## F-statistic: 42.96 on 2 and 182 DF, p-value: 5.226e-16
The model coefficients changed quite a lot when we added in wind speed. The intercept is now negative, and the wind speed coefficient is huge! When interpreting coefficients, it’s important to keep the scale in mind. Wind speed ranges from 0.05 to 0.44 so when you multiply 1172 by 0.05 for example, you end up with about 60, which is within the range we’d expect.
Let’s try one more, this time we’ll include a factor variable.
rentals_multi =
data %>%
group_by(station_start, is_holiday) %>%
summarize(mean_rentals = mean(no_rentals),
crossing = mean(crossing),
windspeed = mean(windspeed))
head(rentals_multi)
## Source: local data frame [6 x 5]
## Groups: station_start
##
## station_start is_holiday mean_rentals crossing windspeed
## 1 10th & E St NW 0 36.994012 122 0.1740921
## 2 10th & E St NW 1 30.000000 122 0.1471828
## 3 10th & Monroe St NE 0 10.681948 1 0.1883067
## 4 10th & Monroe St NE 1 7.727273 1 0.1947563
## 5 10th & U St NW 0 71.684507 5 0.1894117
## 6 10th & U St NW 1 57.000000 5 0.1947563
# plot crossings, colored by is_holiday
ggplot(rentals_multi,
aes(x = crossing, y = mean_rentals, color = factor(is_holiday))) +
geom_smooth(method = 'lm', size = 2) +
geom_point(size = 4, alpha = 0.60) +
theme_minimal()
# plot windspeed, colored by is_holiday
ggplot(rentals_multi,
aes(x = windspeed, y = mean_rentals, color = factor(is_holiday))) +
geom_smooth(method = 'lm', size = 2) +
geom_point(size = 4, alpha = 0.60) +
theme_minimal()
model = lm(mean_rentals ~ crossing + windspeed + factor(is_holiday),
data = rentals_multi)
summary(model)
##
## Call:
## lm(formula = mean_rentals ~ crossing + windspeed + factor(is_holiday),
## data = rentals_multi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.310 -19.253 -6.486 12.377 131.761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.54159 18.54479 -2.779 0.00573 **
## crossing 0.40017 0.04537 8.819 < 2e-16 ***
## windspeed 436.86213 99.43162 4.394 1.46e-05 ***
## factor(is_holiday)1 -7.52366 2.80516 -2.682 0.00765 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.9 on 366 degrees of freedom
## Multiple R-squared: 0.2335, Adjusted R-squared: 0.2272
## F-statistic: 37.17 on 3 and 366 DF, p-value: < 2.2e-16
The output looks a little funny now. There’s a term called factor(is_holiday)1, what does that mean? Factors are category variables and their interpretation is relative to a baseline. Our factor is_holiday only has two levels, 0 and 1, and R sets 0 to the baseline by default. So the interpretation of that term is that we can expect about 10 additional rentals when it is a holiday (i.e. is_holiday == 0) and the other variables are fixed.
For this section, we’ll use the fully cleaned and combined data from the project-1-data-cleanup file, so make sure you’ve gone through and cleaned your data up like that, or download the clean file from here.
data = read.delim('final_modeling_data.tsv', sep = '\t', header = TRUE)
We’ll be using the caret package (short for classification and regression training) for model development because it integrates many modeling packages in R into one unified syntax. That means more reusable code for us! caret contains helper functions that provide a unified framework for data cleaning/splitting, model training, and comparison. I highly recommend the optional reading this week which provides a great overview of the caret package.
install.packages('caret', dependencies = TRUE)
library(caret)
set.seed(1234) # set a seed
Setting a seed in R insures that you get identical results each time you run your code. Since resampling methods are inherently probabilistic, every time we rerun them we’ll get slightly different answers. Setting the seed to the same number insures that we get identical randomness each time the code is run, and that’s helpful for debugging.
Before any analysis in this class we’ll need to divide our data into train and test sets. Check out this nice overview for more details. The training set is typically about 75% of the data and is used for all the model development. Once we have a model we’re satisfied with, we use our testing set, the other 25% to generate model predictions. Splitting the data into the two groups, train and test, generates two types of errors, in-sample and out-of-sample errors. In-sample errors are the errors derived from same data the model was built with. Out-of-sample errors are derived from measuring the error on a fresh data set. We are interested in the out-of-sample error because this quantity represents how’d we’d expect the model to perform in the future on brand new data.
Here’s how to split the data with caret:
# select the training observations
in_train = createDataPartition(y = data$rentals,
p = 0.75, # 75% in train, 25% in test
list = FALSE)
head(in_train) # row indices of observations in the training set
## Resample1
## [1,] 7
## [2,] 12
## [3,] 25
## [4,] 72
## [5,] 92
## [6,] 93
train = data[in_train, ]
test = data[-in_train, ]
dim(train)
## [1] 17544 119
dim(test)
## [1] 5846 119
Note: I recommend doing all data processing and aggregation steps before splitting out your train/test sets.
Our workhorse function in the caret package in the train function. This function can be used to evaluate performance parameters, choose optimal models based on the values of those parameters, and estimate model performance. For regression we can use it in place of the lm() function. Here’s our last regression model using the train function.
model_fit = train(rentals ~ crossing + windspeed + factor(is_holiday),
data = train,
method = 'lm',
metric = 'RMSE')
print(model_fit)
## Linear Regression
##
## 17544 samples
## 118 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 17544, 17544, 17544, 17544, 17544, 17544, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 26.67981 0.06043381 0.3414677 0.004150964
##
##
summary(model_fit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.402 -14.852 -10.088 6.122 235.311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.909526 0.725825 23.297 < 2e-16 ***
## crossing 0.213221 0.006357 33.543 < 2e-16 ***
## windspeed -0.901991 3.628303 -0.249 0.804
## `factor(is_holiday)1` -4.664848 0.681877 -6.841 8.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.71 on 17540 degrees of freedom
## Multiple R-squared: 0.0622, Adjusted R-squared: 0.06204
## F-statistic: 387.8 on 3 and 17540 DF, p-value: < 2.2e-16
# get predictions
out_of_sample_predictions = predict(model_fit, newdata = test)
# compare predictions against the observed values
errors = data.frame(predicted = out_of_sample_predictions,
observed = test$rentals,
error = out_of_sample_predictions - test$rentals)
# eh, not so good
ggplot(data = errors, aes(x = predicted, y = observed)) +
geom_abline(aes(intercept = 0, slope = 1),
size = 3, alpha = 0.70, color = 'red') +
geom_point(size = 3, alpha = 0.80) +
ggtitle('out-of-sample errors') +
theme_minimal()
Our prediction accuracy is not so great for this model. The in-sample RMSE is about 27 which means that on average the predictions are off by about 27 rentals. Let’s fit the giant model we made before:
full_model = train(rentals ~ .,
data = train,
method = 'lm')
The in-sample RMSE is about 19, so definitely an improvement over the previous model, but this model is really complex and probably not going to be usable by Pronto. How can we reduce the complexity of the model, but maintain reasonable predictive accuracy?
Shrinkage methods require that the predictors are normalized to be on the same scale. We can accomplish this by centering and scaling the data. You center a variable by subtracting the mean of the variable from from each observation. To scale your observations you then divide the centered observation by the variable standard deviation. Now the variable follows a standard normal distribution with mean = 0 and standard deviation = 1.
The caret package has lots of convenient functions for preprocessing data, check ’em out!
We run into some trouble if we try to just center and scale the data because its got factor variables and you can’t subtract a number from a category. We can use the model.matrix function to fix that really quickly.
no_factors = as.data.frame(model.matrix(rentals ~ . -1, data = data))
# put rentals back on
no_factors$rentals = data$rentals
full_model_scaled = train(rentals ~ .,
data = no_factors,
method = 'lm',
preProcess = c('center', 'scale'))
Coefficients estimated with normalized data have a different interpretation than coefficients from un-normalized data. In this case when the data are scaled the intercept has a better interpretation, it’s the expected number of rentals when all the predictors are at their average value. So, in this case, when all the predictors are at their average values, we expect about 21 rentals per day. In the previous full-model we had an intercept of about -28, which could be interpreted as the expected number of rentals when all the other predictors have a value of 0. That’s pretty unsatisfying for a couple reasons. First, we can’t have negative rentals! Second, for a lot of the predictors it doesn’t make sense to plug in 0’s. What does it mean to have a duration of 0? Or a temp of 0? Centering and scaling fix the non-interpret ability of the previous models.
Since we divide by the standard deviation during scaling, the non-intercept coefficients in the centered and scaled model can be interpreted as the increase in \(y\) associated with a 1 standard deviation increase in \(x\).
A simple method to reduce model complexity is to combine some of the variables. For example the dataset contains a variable for alcohol, pub and bar, likewise there’s a variable for food_court, restaurant, food_cart, and fast_food. Maybe we can retain information and remove some variables.
no_factors$food = no_factors$fast_food + no_factors$restaurant +
no_factors$food_court + no_factors$bar.restaurant +
no_factors$cafe + no_factors$food_cart
no_factors$nightlife = no_factors$bar + no_factors$club +
no_factors$pub + no_factors$nightclub
no_factors$seedy_stuff = no_factors$stripclub + no_factors$strip_club +
no_factors$alcohol + no_factors$check_cashing + no_factors$motel +
no_factors$hostel
no_factors$tourism = no_factors$theatre + no_factors$arts_centre +
no_factors$tourist + no_factors$school..historic. + no_factors$hotel +
no_factors$gallery + no_factors$artwork + no_factors$sculpture +
no_factors$museum + no_factors$tour_guide + no_factors$car_rental +
no_factors$guest_house + no_factors$landmark + no_factors$attraction +
no_factors$information
dim(no_factors)
## [1] 23390 133
# now remove those variables from the no_factorsset
no_factors =
no_factors %>%
select(-fast_food, -restaurant, -food_court, -bar.restaurant, -cafe,
-food_cart, -bar, -club, -pub, -nightclub, -stripclub, -strip_club,
-alcohol, -check_cashing, -motel, -hostel, -theatre, -arts_centre,
-tourist, -school..historic., -hotel, -gallery, -artwork, -sculpture,
-museum, -tour_guide, -car_rental, -guest_house, -landmark,
-attraction, -information)
# Reduced the dataset by 31 variables!
dim(no_factors)
## [1] 23390 102
Try out your own categories, these are just a few to get you started. We’ll learn how to make categories computationally when we cover clustering.
We’ve change the dataframe, don’t forget to redefine the train and test sets!
train = no_factors[in_train, ]
test = no_factors[-in_train, ]
dim(train)
## [1] 17544 102
dim(test)
## [1] 5846 102
# how does our new full-model compare?
full_model = train(rentals ~ .,
data = train,
method = 'lm')
We haven’t talked much about computational limitations yet, but it’s a good time to start. Selection methods can be extremely slow. Why? Because we have \(2^p = 2^{117}\) possible variable combinations. I recommend doing some combining before trying these methods. I’ll leave the combining up to you, but to make sure these models can run in less than infinite time, I’m going to remove a bunch of predictors so you get the idea.
train =
train %>%
select(rentals, cust_typeCasual, cust_typeRegistered, cust_typeSubscriber,
weekday1, weekday2, weekday3, weekday4, weekday5, weekday6,
season_code2, season_code3, season_code4, is_holiday, weather_code2,
weather_code3, humidity, windspeed, temp, duration, food, nightlife,
seedy_stuff, tourism)
test =
test %>%
select(rentals, cust_typeCasual, cust_typeRegistered, cust_typeSubscriber,
weekday1, weekday2, weekday3, weekday4, weekday5, weekday6,
season_code2, season_code3, season_code4, is_holiday, weather_code2,
weather_code3, humidity, windspeed, temp, duration, food, nightlife,
seedy_stuff, tourism)
# forward selection
forward_model = train(rentals ~ .,
data = na.omit(train),
method = 'leapForward',
preProcess = c('center', 'scale'),
# try models of size 1 - 23
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
# what does this return?
attributes(forward_model)
## $names
## [1] "method" "modelInfo" "modelType" "results"
## [5] "pred" "bestTune" "call" "dots"
## [9] "metric" "control" "finalModel" "preProcess"
## [13] "trainingData" "resample" "resampledCM" "perfNames"
## [17] "maximize" "yLimits" "times" "terms"
## [21] "coefnames" "xlevels"
##
## $class
## [1] "train" "train.formula"
# what what should the number of variables, k, be?
forward_model$bestTune
## nvmax
## 22 22
# what metric was used?
forward_model$metric
## [1] "RMSE"
# here's a handful of other useful plots and summaries
print(forward_model)
## Linear Regression with Forward Selection
##
## 17544 samples
## 23 predictor
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 14036, 14037, 14034, 14034, 14035
##
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared RMSE SD Rsquared SD
## 1 25.34630 0.1558734 0.5171403 0.01993912
## 2 24.93580 0.1829453 0.5922634 0.02108161
## 3 24.90329 0.1850447 0.6108495 0.02084754
## 4 24.89912 0.1852932 0.6077094 0.02095107
## 5 24.86277 0.1876819 0.6073913 0.02100862
## 6 24.85255 0.1883723 0.6193269 0.02130686
## 7 24.22592 0.2285633 0.5668672 0.02162475
## 8 24.22682 0.2285179 0.5676854 0.02186734
## 9 24.35505 0.2213029 0.7227301 0.02442361
## 10 24.36359 0.2210779 0.7354085 0.02497915
## 11 23.70296 0.2626084 1.2645873 0.06460206
## 12 23.51564 0.2744719 1.0021704 0.05322519
## 13 23.25240 0.2905890 0.7347665 0.02825663
## 14 23.18618 0.2942423 0.6875367 0.02641883
## 15 23.15332 0.2962349 0.6786997 0.02776622
## 16 23.10479 0.2990525 0.6630661 0.02765034
## 17 23.09873 0.2995969 0.6716214 0.02702222
## 18 23.09999 0.2995307 0.6732473 0.02707303
## 19 23.09414 0.2998882 0.6786158 0.02728513
## 20 23.09389 0.2999094 0.6844024 0.02764143
## 21 23.07183 0.3012558 0.6766023 0.02730813
## 22 23.03024 0.3036193 0.6746600 0.02760871
## 23 23.03024 0.3036193 0.6746600 0.02760871
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 22.
summary(forward_model)
## Subset selection object
## 23 Variables (and intercept)
## Forced in Forced out
## cust_typeCasual FALSE FALSE
## cust_typeRegistered FALSE FALSE
## weekday1 FALSE FALSE
## weekday2 FALSE FALSE
## weekday3 FALSE FALSE
## weekday4 FALSE FALSE
## weekday5 FALSE FALSE
## weekday6 FALSE FALSE
## season_code2 FALSE FALSE
## season_code3 FALSE FALSE
## season_code4 FALSE FALSE
## is_holiday FALSE FALSE
## weather_code2 FALSE FALSE
## weather_code3 FALSE FALSE
## humidity FALSE FALSE
## windspeed FALSE FALSE
## temp FALSE FALSE
## duration FALSE FALSE
## food FALSE FALSE
## nightlife FALSE FALSE
## seedy_stuff FALSE FALSE
## tourism FALSE FALSE
## cust_typeSubscriber FALSE FALSE
## 1 subsets of each size up to 22
## Selection Algorithm: forward
## cust_typeCasual cust_typeRegistered cust_typeSubscriber weekday1
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " "
## 9 ( 1 ) "*" "*" " " " "
## 10 ( 1 ) "*" "*" " " " "
## 11 ( 1 ) "*" "*" " " " "
## 12 ( 1 ) "*" "*" " " " "
## 13 ( 1 ) "*" "*" " " " "
## 14 ( 1 ) "*" "*" " " " "
## 15 ( 1 ) "*" "*" " " " "
## 16 ( 1 ) "*" "*" " " " "
## 17 ( 1 ) "*" "*" " " " "
## 18 ( 1 ) "*" "*" " " "*"
## 19 ( 1 ) "*" "*" " " "*"
## 20 ( 1 ) "*" "*" " " "*"
## 21 ( 1 ) "*" "*" " " "*"
## 22 ( 1 ) "*" "*" " " "*"
## weekday2 weekday3 weekday4 weekday5 weekday6 season_code2
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " "*"
## 11 ( 1 ) " " " " " " " " " " "*"
## 12 ( 1 ) " " " " " " " " " " "*"
## 13 ( 1 ) " " " " " " " " " " "*"
## 14 ( 1 ) " " " " " " " " " " "*"
## 15 ( 1 ) " " " " " " "*" " " "*"
## 16 ( 1 ) " " " " " " "*" "*" "*"
## 17 ( 1 ) " " "*" " " "*" "*" "*"
## 18 ( 1 ) " " "*" " " "*" "*" "*"
## 19 ( 1 ) " " "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*" "*"
## season_code3 season_code4 is_holiday weather_code2 weather_code3
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " " " "*"
## 7 ( 1 ) "*" " " " " " " "*"
## 8 ( 1 ) "*" " " "*" " " "*"
## 9 ( 1 ) "*" " " "*" " " "*"
## 10 ( 1 ) "*" " " "*" " " "*"
## 11 ( 1 ) "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" " " "*"
## 13 ( 1 ) "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*"
## humidity windspeed temp duration food nightlife seedy_stuff
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " " " "*" " "
## 5 ( 1 ) " " " " " " " " " " "*" " "
## 6 ( 1 ) " " " " " " " " " " "*" " "
## 7 ( 1 ) " " " " " " " " "*" "*" " "
## 8 ( 1 ) " " " " " " " " "*" "*" " "
## 9 ( 1 ) " " " " "*" " " "*" "*" " "
## 10 ( 1 ) " " " " "*" " " "*" "*" " "
## 11 ( 1 ) " " " " "*" " " "*" "*" " "
## 12 ( 1 ) " " " " "*" " " "*" "*" "*"
## 13 ( 1 ) " " " " "*" " " "*" "*" "*"
## 14 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 15 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 16 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 17 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 18 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 19 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 20 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## tourism
## 1 ( 1 ) " "
## 2 ( 1 ) "*"
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
## 20 ( 1 ) "*"
## 21 ( 1 ) "*"
## 22 ( 1 ) "*"
plot(forward_model)
plot(varImp(forward_model))
# compare all the models
plot(forward_model$finalModel, scale = 'adjr2')
# backward_selection
backward_model = train(rentals ~ .,
data = na.omit(train),
method = 'leapBackward',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
plot(backward_model)
plot(backward_model$finalModel, scale = 'adjr2')
plot(varImp(backward_model, scale = TRUE))
# steps in both directions
hybrid_model = train(rentals ~ .,
data = na.omit(train),
method = 'leapSeq',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
plot(hybrid_model)
plot(hybrid_model$finalModel, scale = 'adjr2')
plot(varImp(hybrid_model))
# ridge regression
ridge_model = train(rentals ~ .,
data = train,
method = 'ridge',
preProcess = c('center', 'scale'),
tuneLength = 10,
# reducing the cv for speed
trControl = trainControl(method = 'cv', number = 5))
print(ridge_model)
## Ridge Regression
##
## 17544 samples
## 23 predictor
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 14035, 14034, 14036, 14035, 14036
##
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared RMSE SD Rsquared SD
## 0.0000000000 22.91761 0.3096930 0.1665859 0.01280914
## 0.0001000000 22.91760 0.3096934 0.1665527 0.01281036
## 0.0002371374 22.91760 0.3096939 0.1665074 0.01281202
## 0.0005623413 22.91759 0.3096949 0.1664011 0.01281595
## 0.0013335214 22.91758 0.3096965 0.1661550 0.01282512
## 0.0031622777 22.91763 0.3096956 0.1656028 0.01284618
## 0.0074989421 22.91809 0.3096730 0.1644485 0.01289259
## 0.0177827941 22.92037 0.3095484 0.1623692 0.01298722
## 0.0421696503 22.92825 0.3091161 0.1596038 0.01315668
## 0.1000000000 22.94861 0.3081006 0.1582659 0.01340910
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.001333521.
plot(ridge_model)
plot(ridge_model$finalModel)
plot(varImp(ridge_model))
# get the coefficients for the model
# NOTE: shrinkage methods don't have intercept terms
ridge_coefs = predict(ridge_model$finalModel, type = 'coef', mode = 'norm')$coefficients
# ridge regression with variable selection
ridge_model2 = train(rentals ~ .,
data = train,
method = 'foba',
preProcess = c('center', 'scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
print(ridge_model2)
## Ridge Regression with Variable Selection
##
## 17544 samples
## 23 predictor
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 14033, 14036, 14036, 14035, 14036
##
## Resampling results across tuning parameters:
##
## lambda k RMSE Rsquared RMSE SD Rsquared SD
## 1.000000e-05 2 24.16441 0.2321921 0.2759654 0.01442891
## 1.000000e-05 4 23.22468 0.2907670 0.3494591 0.01535095
## 1.000000e-05 6 23.08243 0.2994412 0.3349153 0.01571250
## 1.000000e-05 9 23.04479 0.3017099 0.3377533 0.01598847
## 1.000000e-05 11 22.99470 0.3047225 0.3331931 0.01467418
## 1.000000e-05 13 22.96824 0.3063260 0.3276297 0.01504069
## 1.000000e-05 16 22.95721 0.3069987 0.3194928 0.01454891
## 1.000000e-05 18 22.93837 0.3081493 0.3210620 0.01451841
## 1.000000e-05 20 22.90964 0.3098953 0.3353790 0.01543189
## 1.000000e-05 23 22.91341 0.3096724 0.3382143 0.01558103
## 2.782559e-05 2 24.16441 0.2321921 0.2759646 0.01442891
## 2.782559e-05 4 23.22468 0.2907670 0.3494561 0.01535091
## 2.782559e-05 6 23.08243 0.2994412 0.3349121 0.01571245
## 2.782559e-05 9 23.04479 0.3017098 0.3377501 0.01598844
## 2.782559e-05 11 22.99470 0.3047225 0.3331914 0.01467417
## 2.782559e-05 13 22.96824 0.3063258 0.3276323 0.01504087
## 2.782559e-05 16 22.95721 0.3069986 0.3194955 0.01454911
## 2.782559e-05 18 22.93838 0.3081492 0.3210652 0.01451862
## 2.782559e-05 20 22.90963 0.3098954 0.3353819 0.01543203
## 2.782559e-05 23 22.91341 0.3096724 0.3382184 0.01558111
## 7.742637e-05 2 24.16441 0.2321921 0.2759624 0.01442890
## 7.742637e-05 4 23.22468 0.2907670 0.3494479 0.01535079
## 7.742637e-05 6 23.08243 0.2994412 0.3349031 0.01571230
## 7.742637e-05 9 23.04479 0.3017098 0.3377411 0.01598834
## 7.742637e-05 11 22.99470 0.3047224 0.3331866 0.01467412
## 7.742637e-05 13 22.96824 0.3063255 0.3276395 0.01504137
## 7.742637e-05 16 22.95721 0.3069984 0.3195030 0.01454966
## 7.742637e-05 18 22.93838 0.3081488 0.3210741 0.01451923
## 7.742637e-05 20 22.90963 0.3098955 0.3353900 0.01543244
## 7.742637e-05 23 22.91341 0.3096726 0.3382302 0.01558152
## 2.154435e-04 2 24.16441 0.2321921 0.2759565 0.01442889
## 2.154435e-04 4 23.22468 0.2907669 0.3494250 0.01535045
## 2.154435e-04 6 23.08243 0.2994411 0.3348783 0.01571190
## 2.154435e-04 9 23.04479 0.3017096 0.3377162 0.01598808
## 2.154435e-04 11 22.99470 0.3047223 0.3331734 0.01467399
## 2.154435e-04 13 22.96826 0.3063245 0.3276597 0.01504276
## 2.154435e-04 16 22.95722 0.3069976 0.3195241 0.01455118
## 2.154435e-04 18 22.93839 0.3081479 0.3210988 0.01452090
## 2.154435e-04 20 22.90961 0.3098958 0.3354125 0.01543355
## 2.154435e-04 23 22.91339 0.3096732 0.3382628 0.01558268
## 5.994843e-04 2 24.16441 0.2321921 0.2759400 0.01442885
## 5.994843e-04 4 23.22468 0.2907668 0.3493614 0.01534951
## 5.994843e-04 6 23.08243 0.2994410 0.3348093 0.01571076
## 5.994843e-04 9 23.04479 0.3017091 0.3376469 0.01598736
## 5.994843e-04 11 22.99469 0.3047221 0.3331366 0.01467363
## 5.994843e-04 13 22.96830 0.3063217 0.3277163 0.01504665
## 5.994843e-04 16 22.95725 0.3069955 0.3195833 0.01455544
## 5.994843e-04 18 22.93843 0.3081453 0.3211678 0.01452558
## 5.994843e-04 20 22.90957 0.3098975 0.3354735 0.01543562
## 5.994843e-04 23 22.91334 0.3096756 0.3383560 0.01558474
## 1.668101e-03 2 24.16442 0.2321921 0.2758943 0.01442873
## 1.668101e-03 4 23.22469 0.2907662 0.3491848 0.01534691
## 1.668101e-03 6 23.08244 0.2994405 0.3346179 0.01570762
## 1.668101e-03 9 23.04479 0.3017077 0.3374545 0.01598535
## 1.668101e-03 11 22.98837 0.3051108 0.3341303 0.01527012
## 1.668101e-03 13 22.96843 0.3063137 0.3278780 0.01505759
## 1.668101e-03 16 22.96027 0.3068181 0.3193561 0.01464686
## 1.668101e-03 18 22.93854 0.3081377 0.3213624 0.01453863
## 1.668101e-03 20 22.90949 0.3098997 0.3356359 0.01544221
## 1.668101e-03 23 22.91329 0.3096767 0.3385955 0.01559395
## 4.641589e-03 2 24.16448 0.2321921 0.2757684 0.01442841
## 4.641589e-03 4 23.22483 0.2907642 0.3486969 0.01533966
## 4.641589e-03 6 23.08257 0.2994387 0.3340906 0.01569890
## 4.641589e-03 9 23.04490 0.3017030 0.3369241 0.01597991
## 4.641589e-03 11 22.98846 0.3051095 0.3338080 0.01527004
## 4.641589e-03 13 22.97542 0.3058984 0.3417399 0.01562629
## 4.641589e-03 16 22.96060 0.3068010 0.3198274 0.01467872
## 4.641589e-03 18 22.93898 0.3081144 0.3219154 0.01457510
## 4.641589e-03 20 22.90952 0.3098919 0.3360699 0.01545568
## 4.641589e-03 23 22.91310 0.3096869 0.3395502 0.01563514
## 1.291550e-02 2 24.16501 0.2321921 0.2754286 0.01442752
## 1.291550e-02 4 23.22593 0.2907544 0.3473658 0.01531957
## 1.291550e-02 6 23.08366 0.2994297 0.3326622 0.01567485
## 1.291550e-02 9 23.02771 0.3028091 0.3463873 0.01652423
## 1.291550e-02 11 22.98949 0.3050983 0.3329332 0.01526908
## 1.291550e-02 13 22.97624 0.3058990 0.3404989 0.01561102
## 1.291550e-02 16 22.96235 0.3067443 0.3211597 0.01476606
## 1.291550e-02 18 22.93745 0.3082666 0.3287773 0.01494385
## 1.291550e-02 20 22.91138 0.3098055 0.3374303 0.01550510
## 1.291550e-02 23 22.91409 0.3096441 0.3391623 0.01562220
## 3.593814e-02 2 24.16899 0.2321920 0.2745598 0.01442513
## 3.593814e-02 4 23.23396 0.2906973 0.3438570 0.01526427
## 3.593814e-02 6 23.09181 0.2993769 0.3289682 0.01560958
## 3.593814e-02 9 23.03531 0.3027417 0.3427833 0.01649648
## 3.593814e-02 11 22.99625 0.3050558 0.3312629 0.01524488
## 3.593814e-02 13 22.97477 0.3060977 0.3347229 0.01539609
## 3.593814e-02 16 22.96066 0.3069590 0.3271624 0.01546492
## 3.593814e-02 18 22.94541 0.3078998 0.3414731 0.01566311
## 3.593814e-02 20 22.92076 0.3094149 0.3391434 0.01558767
## 3.593814e-02 23 22.92076 0.3094149 0.3391434 0.01558767
## 1.000000e-01 2 24.19607 0.2321918 0.2726591 0.01441895
## 1.000000e-01 4 23.24676 0.2907311 0.3397342 0.01518991
## 1.000000e-01 6 23.11912 0.2985650 0.3529153 0.01679539
## 1.000000e-01 9 23.04864 0.3026729 0.3419578 0.01656237
## 1.000000e-01 11 23.01178 0.3049311 0.3320891 0.01537461
## 1.000000e-01 13 23.00910 0.3050803 0.3324609 0.01557251
## 1.000000e-01 16 23.00910 0.3050803 0.3324609 0.01557251
## 1.000000e-01 18 23.00910 0.3050803 0.3324609 0.01557251
## 1.000000e-01 20 23.00910 0.3050803 0.3324609 0.01557251
## 1.000000e-01 23 23.00910 0.3050803 0.3324609 0.01557251
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were k = 20 and lambda = 0.001668101.
plot(ridge_model2)
plot(varImp(ridge_model2))
Selection, ridge regression, and lasso are just a couple techniques at our disposal for decreasing our model size. See this page for a list of other available options to try out if you like.
lasso_model = train(rentals ~ .,
data = na.omit(train),
method = 'lasso',
preProc = c('scale', 'center'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
print(lasso_model)
## The lasso
##
## 17544 samples
## 23 predictor
##
## Pre-processing: scaled, centered
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 14035, 14036, 14036, 14035, 14034
##
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared RMSE SD Rsquared SD
## 0.1000000 25.29124 0.2329375 1.3265385 0.03973063
## 0.1888889 24.42886 0.2699559 0.9687530 0.01940618
## 0.2777778 23.78761 0.2882062 0.7871721 0.01335550
## 0.3666667 23.36073 0.2938053 0.7266509 0.01269411
## 0.4555556 23.13016 0.2999341 0.7261290 0.01280352
## 0.5444444 23.02360 0.3038177 0.7314222 0.01334846
## 0.6333333 22.98007 0.3058203 0.7354855 0.01351454
## 0.7222222 22.95372 0.3073065 0.7426890 0.01381991
## 0.8111111 22.93304 0.3085224 0.7490337 0.01421347
## 0.9000000 22.91961 0.3093079 0.7539243 0.01452186
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.
plot(lasso_model)
plot(varImp(lasso_model))
plot(lasso_model$finalModel)
# get the model coefficients
lasso_coefs = predict(lasso_model$finalModel, type = 'coef', mode = 'norm')$coefficients
All right, now we’ve got a nice collection of models. Which one should we report?
results = resamples(list(forward_selection = forward_model,
backward_selection = backward_model,
hybrid_selection = hybrid_model,
ridge_regression = ridge_model,
lasso_regeression = lasso_model))
# compare RMSE and R-squared
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: forward_selection, backward_selection, hybrid_selection, ridge_regression, lasso_regeression
## Number of resamples: 5
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## forward_selection 24.06 24.37 25.22 24.86 25.26 25.41 0
## backward_selection 24.29 24.88 24.89 24.86 25.05 25.20 0
## hybrid_selection 24.17 24.63 24.80 24.86 25.03 25.66 0
## ridge_regression 22.00 22.43 23.26 22.92 23.37 23.54 0
## lasso_regeression 22.14 22.59 22.79 22.91 22.86 24.18 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## forward_selection 0.1514 0.1920 0.1939 0.1877 0.1951 0.2059 0
## backward_selection 0.1770 0.1864 0.1891 0.1872 0.1913 0.1920 0
## hybrid_selection 0.1768 0.1868 0.1880 0.1873 0.1914 0.1934 0
## ridge_regression 0.2728 0.3120 0.3137 0.3092 0.3204 0.3270 0
## lasso_regeression 0.2985 0.3003 0.3094 0.3097 0.3168 0.3236 0
# plot results
dotplot(results)
Those are in-sample statistics however, so if we want to compare the model’s out-of-sample prediction accuracy, we need to compute the RMSE using the test data we held out. Let’s compare two models: backward selection and lasso:
backward_predictions = predict(backward_model, test)
sqrt(mean((backward_predictions - test$rentals)^2 , na.rm = TRUE))
## [1] 22.3518
lasso_predictions = predict(lasso_model, test)
sqrt(mean((lasso_predictions - test$rentals)^2 , na.rm = TRUE))
## [1] 22.34778
Check out this list of different model selection methods and try a couple out.
Once you’ve spent some time exploring candidate models, pick one and use it in your report.